Massachusetts General Hospital¶

Collaborative effort to investigate the plasma proteomic signatures of COVID-19 positive patients¶

Data provided by the MGH Emergency Department COVID-19 Cohort (Filbin, Goldberg, Hacohen) with Olink Proteomics. https://www.olink.com/mgh-covid-study/

Initial analysis¶

Plasma proteomics reveals tissue-specific cell death and mediators of cell-cell interactions in severe COVID-19 patients¶

https://www.biorxiv.org/content/10.1101/2020.11.02.365536v2

By analyzing several thousand plasma proteins in 306 COVID-19 patients and 78 symptomatic controls over serial timepoints using two complementary approaches, we uncover COVID-19 host immune and non-immune proteins not previously linked to this disease. Integration of plasma proteomics with nine published scRNAseq datasets shows that SARS-CoV-2 infection upregulates monocyte/macrophage, plasmablast, and T cell effector proteins. By comparing patients who died to severely ill patients who survived, we identify dynamic immunomodulatory and tissue-associated proteins associated with survival, providing insights into which host responses are beneficial and which are detrimental to survival.

Study design

Proposed Analyses¶

We reproduce the results obtained in Filbin et al 2020 using the Olink data (timepoint D0)
We compare COVID-19 positive patients to identify biomarkers of severity (timepoint D0)

Note: To define the different severity groups we make use of the WHO scores assigned to the patients.

[1]:

import os

from ckg.analytics_core.analytics import analytics
from ckg.analytics_core.viz import viz

from ckg.graphdb_connector import connector
from ckg.graphdb_builder import builder_utils

from plotly.offline import init_notebook_mode, iplot
%matplotlib inline
init_notebook_mode(connected=True)

c:\users\sande\.conda\envs\pip_rev\lib\site-packages\outdated\utils.py:18: OutdatedPackageWarning: The package pingouin is out of date. Your version is 0.3.11, the latest is 0.3.12.
Set the environment variable OUTDATED_IGNORE=1 to disable these warnings.
  **kwargs

WGCNA functions will not work. Module Rpy2 not installed.
R functions will not work. Module Rpy2 not installed.

Reading proteomics and clinical data¶

[2]:

#### Access: Data provided by the MGH Emergency Department COVID-19 Cohort
#### (Filbin, Goldberg, Hacohen) with Olink Proteomics. https://www.olink.com/mgh-covid-study/

olink_file = '../../assets/MGH_COVID_OLINK_NPX.txt'
olink_clin_file = '../../assets/MGH_COVID_Clinical_Info.txt'

Proteomics data¶

[3]:

olink_data = builder_utils.readDataFromCSV(olink_file, sep=';', comment='#')

[4]:

olink_data.head()

[4]:

	SampleID	subject_id	Timepoint	Index	OlinkID	UniProt	Assay	MissingFreq	Panel	Panel_Version	PlateID	QC_Warning	LOD	NPX
0	1_D0	1.0	D0	56	OID21311	Q9BTE6	AARSD1	0.04	ONCOLOGY	1	20200772_Plate5_NEURO_ONC	Pass	0.7204	3.2277
1	1_D0	1.0	D0	56	OID20921	Q96IU4	ABHD14B	0.06	NEUROLOGY	1	20200772_Plate5_NEURO_ONC	Pass	0.5696	0.7205
2	1_D0	1.0	D0	56	OID21280	P00519	ABL1	0.04	ONCOLOGY	1	20200772_Plate5_NEURO_ONC	Pass	0.5313	2.6293
3	1_D0	1.0	D0	56	OID21269	P09110	ACAA1	0.12	ONCOLOGY	1	20200772_Plate5_NEURO_ONC	Pass	2.0588	3.2670
4	1_D0	1.0	D0	56	OID20159	P16112	ACAN	0.04	CARDIOMETABOLIC	1	20200772_Plate5_CARDIO_INF	Pass	1.1623	2.0308

QC and data exploration¶

[5]:

print("Number of proteins:", len(olink_data['UniProt'].unique()))

Number of proteins: 1420

[6]:

olink_data.shape

[6]:

(1148916, 14)

Remove proteins with QC warnings¶

[7]:

olink_data['QC_Warning'].unique()

[7]:

array(['Pass', 'Warning'], dtype=object)

[8]:

olink_data.groupby('QC_Warning').count()['SampleID']

[8]:

QC_Warning
Pass       1136634
Warning      12282
Name: SampleID, dtype: int64

[9]:

olink_data = olink_data[olink_data['QC_Warning'] == 'Pass']

[10]:

olink_data.shape

[10]:

(1136634, 14)

[11]:

print("Number of proteins that passed Olink QC:", len(olink_data['UniProt'].unique()))

Number of proteins that passed Olink QC: 1420

[12]:

print("Total number of patients:", len(olink_data['subject_id'].unique()))

Total number of patients: 384

[13]:

print("Total number of samples:", len(olink_data['SampleID'].unique()))

Total number of samples: 786

[14]:

olink_data.shape

[14]:

(1136634, 14)

Remove rows with missing subject id¶

[15]:

olink_data = olink_data.dropna(subset=['subject_id'])

[16]:

olink_data['subject_id'] = olink_data['subject_id'].astype('int').astype('str')

[17]:

olink_data.shape

[17]:

(1108054, 14)

Merge protein names and identifiers¶

[18]:

olink_data['identifier'] = olink_data['Assay'] +"~"+olink_data['UniProt']

[19]:

olink_data.head()

[19]:

	SampleID	subject_id	Timepoint	Index	OlinkID	UniProt	Assay	MissingFreq	Panel	Panel_Version	PlateID	QC_Warning	LOD	NPX	identifier
0	1_D0	1	D0	56	OID21311	Q9BTE6	AARSD1	0.04	ONCOLOGY	1	20200772_Plate5_NEURO_ONC	Pass	0.7204	3.2277	AARSD1~Q9BTE6
1	1_D0	1	D0	56	OID20921	Q96IU4	ABHD14B	0.06	NEUROLOGY	1	20200772_Plate5_NEURO_ONC	Pass	0.5696	0.7205	ABHD14B~Q96IU4
2	1_D0	1	D0	56	OID21280	P00519	ABL1	0.04	ONCOLOGY	1	20200772_Plate5_NEURO_ONC	Pass	0.5313	2.6293	ABL1~P00519
3	1_D0	1	D0	56	OID21269	P09110	ACAA1	0.12	ONCOLOGY	1	20200772_Plate5_NEURO_ONC	Pass	2.0588	3.2670	ACAA1~P09110
4	1_D0	1	D0	56	OID20159	P16112	ACAN	0.04	CARDIOMETABOLIC	1	20200772_Plate5_CARDIO_INF	Pass	1.1623	2.0308	ACAN~P16112

Clinical data¶

[20]:

olink_clin_data = builder_utils.readDataFromCSV(olink_clin_file, sep=';', comment='#')

[21]:

olink_clin_data.head()

[21]:

	subject_id	COVID	Age cat	BMI cat	LUNG	DIABETES	HTN	...	crp_3_cat	ddimer_3_cat	ldh_3_cat	abs_neut_7_cat	abs_lymph_7_cat	abs_mono_7_cat	creat_7_cat	crp_7_cat	ddimer_7_cat	ldh_7_cat
0	1	1	1	4	0	0	0	...	1.0	1.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	2	1	2	2	0	0	0	...	2.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	3	1	3	4	1	0	0	...	3.0	2.0	3.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	4	1	1	2	0	0	0	...	2.0	2.0	3.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	5	1	3	3	0	1	1	...	5.0	3.0	NaN	3.0	5.0	3.0	1.0	4.0	4.0	3.0

5 rows × 40 columns

[22]:

plot = viz.get_boxplot_grid(olink_clin_data, identifier="clin_vars", args={"title": "Clinical variables",
                                                                           "x":"COVID",
                                                                           "y":"Age cat",
                                                                           "color":"COVID", "width":600})
iplot(plot.figure)

[23]:

olink_clin_data['subject_id'] = olink_clin_data['subject_id'].astype('str')

[24]:

olink_clin_data.columns.tolist()

[24]:

['subject_id',
 'COVID',
 'Age cat',
 'BMI cat',
 'HEART',
 'LUNG',
 'KIDNEY',
 'DIABETES',
 'HTN',
 'IMMUNO',
 'Resp_Symp',
 'Fever_Sympt',
 'GI_Symp',
 'WHO 0',
 'WHO 3',
 'WHO 7',
 'WHO 28',
 'WHO max',
 'abs_neut_0_cat',
 'abs_lymph_0_cat',
 'abs_mono_0_cat',
 'creat_0_cat',
 'crp_0_cat',
 'ddimer_0_cat',
 'ldh_0_cat',
 'Trop_72h',
 'abs_neut_3_cat',
 'abs_lymph_3_cat',
 'abs_mono_3_cat',
 'creat_3_cat',
 'crp_3_cat',
 'ddimer_3_cat',
 'ldh_3_cat',
 'abs_neut_7_cat',
 'abs_lymph_7_cat',
 'abs_mono_7_cat',
 'creat_7_cat',
 'crp_7_cat',
 'ddimer_7_cat',
 'ldh_7_cat']

Merging both datasets¶

[25]:

data = olink_data.set_index('subject_id').join(olink_clin_data.set_index('subject_id')).reset_index()

[26]:

data.head()

[26]:

	subject_id	SampleID	Timepoint	Index	OlinkID	UniProt	Assay	MissingFreq	Panel	Panel_Version	...	crp_3_cat	ddimer_3_cat	ldh_3_cat	abs_neut_7_cat	abs_lymph_7_cat	abs_mono_7_cat	creat_7_cat	crp_7_cat	ddimer_7_cat	ldh_7_cat
0	1	1_D0	D0	56	OID21311	Q9BTE6	AARSD1	0.04	ONCOLOGY	1	...	1.0	1.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	1	1_D0	D0	56	OID20921	Q96IU4	ABHD14B	0.06	NEUROLOGY	1	...	1.0	1.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	1	1_D0	D0	56	OID21280	P00519	ABL1	0.04	ONCOLOGY	1	...	1.0	1.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	1	1_D0	D0	56	OID21269	P09110	ACAA1	0.12	ONCOLOGY	1	...	1.0	1.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	1	1_D0	D0	56	OID20159	P16112	ACAN	0.04	CARDIOMETABOLIC	1	...	1.0	1.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 54 columns

[27]:

data.shape

[27]:

(1108054, 54)

1. Covid-19 positive vs negative¶

[28]:

data = data[data['Timepoint'] == "D0"]

[29]:

data.shape

[29]:

(542015, 54)

[30]:

print("Total number of positive patients:", len(data['subject_id'].unique()))

Total number of positive patients: 383

[31]:

df = data[['subject_id', 'SampleID', 'identifier', 'WHO max', 'WHO 0', 'NPX', 'Age cat', 'COVID', 'BMI cat', 'Timepoint', 'HEART']]

[32]:

print("Total number of positive patients:", len(data['subject_id'].unique()))

Total number of positive patients: 383

[33]:

df_wide = analytics.transform_into_wide_format(df, index=['SampleID', 'subject_id'], columns=['identifier'], values='NPX', extra=['WHO max', 'WHO 0', 'Age cat', 'COVID', 'BMI cat','HEART'])

[34]:

df_wide.head()

[34]:

	SampleID	subject_id	AARSD1~Q9BTE6	ABHD14B~Q96IU4	ABL1~P00519	ACAA1~P09110	ACAN~P16112	ACE2~Q9BYF1	ACOX1~Q15067	ACP5~P13686	...	YES1~P07947	YTHDF3~Q7Z739	ZBTB16~Q05516	ZBTB17~Q13105	WHO max	WHO 0	Age cat	COVID	BMI cat	HEART
0	100_D0	100	3.2374	1.9080	3.4624	3.7201	2.8415	1.4208	0.3747	4.6433	...	5.0236	0.4053	2.9979	1.2529	4	4	1	1	4	0
1	101_D0	101	2.1038	1.1206	2.0700	2.0108	2.6067	0.4767	-0.0337	4.6615	...	3.3785	0.2607	1.9315	1.1605	4	4	2	1	1	0
2	102_D0	102	2.7613	1.3493	2.8657	2.8731	2.0461	0.3986	0.2983	4.4803	...	4.6267	0.4771	1.7620	0.6840	1	1	5	1	3	0
3	103_D0	103	2.6384	0.9447	1.4727	2.6550	1.6613	0.6749	0.1486	3.7081	...	2.9520	0.4808	0.9480	0.4823	6	6	1	1	5	0
4	104_D0	104	5.3336	1.5130	2.0466	1.8627	2.4433	1.5736	-0.1446	3.2100	...	2.3919	0.0496	1.3069	1.1057	6	6	5	0	3	1

5 rows × 1428 columns

[35]:

df_wide.shape

[35]:

(383, 1428)

[36]:

df_wide.describe()

[36]:

	AARSD1~Q9BTE6	ABHD14B~Q96IU4	ABL1~P00519	ACAA1~P09110	ACAN~P16112	ACE2~Q9BYF1	ACOX1~Q15067	ACP5~P13686	ACP6~Q9NPH0	ACTA2~P62736	...	YES1~P07947	YTHDF3~Q7Z739	ZBTB16~Q05516	ZBTB17~Q13105	WHO max	WHO 0	Age cat	COVID	BMI cat	HEART
count	380.000000	380.000000	380.000000	380.000000	378.000000	374.000000	378.000000	383.000000	383.000000	374.000000	...	380.000000	378.000000	380.000000	374.000000	383.000000	383.000000	383.000000	383.000000	383.000000	383.000000
mean	3.532293	1.842658	2.760268	3.463129	2.477235	1.468916	0.137929	4.213379	3.877484	0.186726	...	3.582903	0.771492	1.751017	1.423486	3.498695	3.832898	3.219321	0.796345	2.480418	0.185379
std	1.400306	0.953734	0.991535	1.378123	0.566340	0.933315	0.371251	0.664486	0.799831	0.902741	...	1.538310	0.839493	0.975425	0.676457	1.510651	1.254631	1.208150	0.403242	1.152738	0.389113
min	0.395300	-0.416000	0.722100	0.541900	0.943900	-0.304900	-1.231400	2.751800	-0.470600	-1.322600	...	0.883800	-1.478300	0.176700	-0.153700	1.000000	1.000000	1.000000	0.000000	0.000000	0.000000
25%	2.523550	1.208275	2.030875	2.616600	2.053875	0.855425	-0.075375	3.718600	3.374700	-0.405800	...	2.253875	0.206750	1.057350	0.978850	2.000000	2.000000	2.000000	1.000000	2.000000	0.000000
50%	3.227700	1.663900	2.676800	3.310600	2.500100	1.253250	0.145550	4.222300	3.888400	-0.038550	...	3.395650	0.719950	1.520150	1.330750	4.000000	4.000000	3.000000	1.000000	2.000000	0.000000
75%	4.082625	2.237375	3.296025	4.087600	2.840300	1.855025	0.374600	4.653950	4.314900	0.432250	...	4.673750	1.309325	2.259025	1.771600	4.000000	5.000000	4.000000	1.000000	3.000000	0.000000
max	8.334200	5.339900	6.507100	9.427600	4.161700	6.196200	1.760600	6.656400	7.554500	4.327900	...	8.330600	4.298200	5.154800	4.485200	6.000000	6.000000	5.000000	1.000000	5.000000	1.000000

8 rows × 1426 columns

[ ]:

[37]:

result = analytics.run_correlation(df_wide, alpha=0.05, subject='subject_id', group="COVID", method="spearman", correction='fdr_bh')

[38]:

net = viz.get_network(result, identifier='corr_net', args={'source':'node1', 'target':'node2',
                                                     'weight':'weight', 'values':'weight',
                                                           'title':'Correlation Network',
                                                        'color_weight': True, 'node_size':'degree',
                                                           'cutoff': 0.5, 'cutoff_abs':True})

[39]:

viz.visualize_notebook_network(net['notebook'])

[40]:

df_wide.to_csv('olink_data_NPX_values_WHO_max.tsv', sep='\t', index=False, header=True, doublequote=False)

Imputation using mixed model: KNN when 60% valid values or Probabilistic Minimum Imputation otherwise
Imputation per group based on WHO 0

[41]:

df_wide = analytics.imputation_mixed_norm_KNN(df_wide, index_cols=['WHO 0', 'WHO max', 'SampleID', 'subject_id', 'Age cat', 'COVID', 'BMI cat', 'HEART'], shift=1.8, nstd=0.3, group='COVID', cutoff=0.6)

[42]:

df_wide.head()

[42]:

								AARSD1~Q9BTE6	ABHD14B~Q96IU4	ABL1~P00519	ACAA1~P09110	ACAN~P16112	ACE2~Q9BYF1	ACOX1~Q15067	ACP5~P13686	ACP6~Q9NPH0	ACTA2~P62736	...	WISP2~O76076	WNT9A~O14904	WWP2~O00308	XCL1~P47992	XG~P55808	XPNPEP2~O43895	YES1~P07947	YTHDF3~Q7Z739	ZBTB16~Q05516	ZBTB17~Q13105
WHO 0	WHO max	SampleID	subject_id	Age cat	COVID	BMI cat	HEART
4	4	100_D0	100	1	1	4	0	3.2374	1.9080	3.4624	3.7201	2.8415	1.4208	0.3747	4.6433	4.0392	-0.4904	...	4.4080	0.1052	5.0765	2.5597	1.7649	5.1321	5.0236	0.4053	2.9979	1.2529
4	4	101_D0	101	2	1	1	0	2.1038	1.1206	2.0700	2.0108	2.6067	0.4767	-0.0337	4.6615	3.4319	-0.3363	...	3.8144	0.4324	3.8148	2.0496	1.4435	4.4233	3.3785	0.2607	1.9315	1.1605
1	1	102_D0	102	5	1	3	0	2.7613	1.3493	2.8657	2.8731	2.0461	0.3986	0.2983	4.4803	3.1530	-0.3694	...	4.6518	0.4650	4.7405	1.3652	1.6223	4.4243	4.6267	0.4771	1.7620	0.6840
6	6	103_D0	103	1	1	5	0	2.6384	0.9447	1.4727	2.6550	1.6613	0.6749	0.1486	3.7081	3.6198	-0.9642	...	4.2031	0.2012	3.9818	1.7696	1.7539	2.2632	2.9520	0.4808	0.9480	0.4823
6	6	104_D0	104	5	0	3	1	5.3336	1.5130	2.0466	1.8627	2.4433	1.5736	-0.1446	3.2100	3.8780	1.3395	...	5.3859	1.1573	4.0152	2.4125	2.0343	3.7573	2.3919	0.0496	1.3069	1.1057

5 rows × 1420 columns

[43]:

df_wide.to_csv('olink_data_NPX_values_imputed_KNN.tsv', sep='\t', index=False, header=True, doublequote=False)

[44]:

df_wide = df_wide.reset_index()

[45]:

df_wide['COVID'] = ['COVID-19 positive' if c else 'COVID-19 negative' for c in df_wide['COVID']]

[46]:

pca, args = analytics.run_pca(df_wide, drop_cols=['SampleID', 'subject_id', 'Age cat', 'WHO max', 'BMI cat', 'HEART'], group='COVID', annotation_cols=['SampleID'])
args['group'] = 'group'
args['hovering_cols'] = ['SampleID']
args['factor'] = 250
args['loadings'] = 15
args['title'] = 'Olink data'

[47]:

figure = viz.get_pca_plot(pca, identifier='pca', args=args)
iplot(figure.figure)

[48]:

pca, args = analytics.run_pca(df_wide, drop_cols=['SampleID', 'subject_id', 'Age cat', 'COVID', 'BMI cat', 'HEART', 'WHO max'], group='WHO 0', annotation_cols=['SampleID'])
args['group'] = 'group'
args['hovering_cols'] = ['SampleID']
args['factor'] = 250
args['loadings'] = 15
args['title'] = 'Olink data coloured by WHO score at timepoint D0'

[49]:

figure = viz.get_pca_plot(pca, identifier='pca', args=args)
iplot(figure.figure)

[50]:

pca, args = analytics.run_pca(df_wide, drop_cols=['SampleID', 'subject_id', 'Age cat', 'COVID', 'BMI cat', 'HEART', 'WHO 0'], group='WHO max', annotation_cols=['SampleID'])
args['group'] = 'group'
args['hovering_cols'] = ['SampleID']
args['factor'] = 250
args['loadings'] = 15
args['title'] = 'Olink data coloured by maximum WHO score'

[51]:

figure = viz.get_pca_plot(pca, identifier='pca', args=args)
iplot(figure.figure)

[52]:

pca, args = analytics.run_pca(df_wide, drop_cols=['SampleID', 'subject_id', 'WHO max', 'COVID', 'BMI cat', 'HEART'], group='Age cat', annotation_cols=['SampleID'])
args['group'] = 'group'
args['hovering_cols'] = ['SampleID']
args['factor'] = 250
args['loadings'] = 15
args['title'] = 'Olink data coloured by age category'

[53]:

figure = viz.get_pca_plot(pca, identifier='pca', args=args)
iplot(figure.figure)

[54]:

pca, args = analytics.run_pca(df_wide, drop_cols=['SampleID', 'subject_id', 'WHO max', 'COVID', 'BMI cat'], group='HEART', annotation_cols=['SampleID'])
args['group'] = 'group'
args['hovering_cols'] = ['SampleID']
args['factor'] = 350
args['loadings'] = 20
args['title'] = 'Olink data coloured by HEART'

[55]:

figure = viz.get_pca_plot(pca, identifier='pca', args=args)
iplot(figure.figure)

[56]:

pca, args = analytics.run_pca(df_wide, drop_cols=['SampleID', 'subject_id', 'WHO max', 'COVID', 'HEART'], group='BMI cat', annotation_cols=['SampleID'])
args['group'] = 'group'
args['hovering_cols'] = ['SampleID']
args['factor'] = 350
args['loadings'] = 20
args['title'] = 'Olink data coloured by BMI cat'

[57]:

figure = viz.get_pca_plot(pca, identifier='pca', args=args)
iplot(figure.figure)

Differential regulation: Covid-19 positive vs negative¶

[58]:

ancova_results = analytics.run_ancova(df_wide, covariates=['Age cat', 'HEART', 'BMI cat'], drop_cols=['SampleID', 'subject_id', 'WHO max'], subject='subject_id', group='COVID', is_logged=True)

[59]:

volcano_plot = viz.run_volcano(ancova_results, identifier='proteomics_volcanos',
            args={'alpha':0.01, 'fc':2, 'num_annotations': 50,
                  'colorscale':'Blues', 'showscale': False,
                  'marker_size':8, 'x_title':'log2FC', 'y_title':'-log10(pvalue)'})

for plot in volcano_plot:
    iplot(plot.figure)

Functional enrichment¶

[60]:

go_terms_query = "MATCH (p:Protein)-[]-(bp:Biological_process) WHERE (p.name+'~'+p.id) IN {} RETURN DISTINCT (p.name+'~'+p.id) AS identifier,bp.name AS annotation"
go_terms_query = go_terms_query.format(df_wide.columns.tolist())
annotation = connector.run_query(go_terms_query)

[61]:

annotation.head()

[61]:

	annotation	identifier
0	mitochondrial genome maintenance	AKT3~Q9Y243
1	mitochondrial genome maintenance	TYMP~P19971
2	regulation of DNA recombination	IL7R~P16871
3	cell wall mannoprotein biosynthetic process	MPI~P34949
4	very long-chain fatty acid metabolic process	ACAA1~P09110

[62]:

annotation.shape

[62]:

(19230, 2)

[63]:

enrichment_results = analytics.run_up_down_regulation_enrichment(ancova_results, annotation, identifier='identifier', groups=['group1', 'group2'], annotation_col='annotation', reject_col='rejected', group_col='group', method='fisher', correction='fdr_bh', alpha=0.01, lfc_cutoff=1)

[64]:

figures = viz.get_enrichment_plots(enrichment_results, identifier='enrichment', args={'width':1800})
for fig in figures:
    iplot(fig.figure)

[ ]:

[65]:

df_wide = df_wide[df_wide['COVID'] == 'COVID-19 positive']

[66]:

df_wide['WHO 0'] = df_wide['WHO 0'].astype('str')
df_wide['WHO max'] = df_wide['WHO max'].astype('str')

[67]:

df_wide.shape

[67]:

(305, 1428)

[68]:

ancova_results = analytics.run_ancova(df_wide, covariates=['Age cat', 'HEART', 'BMI cat'], drop_cols=['SampleID', 'subject_id', 'COVID', 'WHO 0'], subject='subject_id', group='WHO max', is_logged=True)

[69]:

volcano_plot = viz.run_volcano(ancova_results, identifier='proteomics_volcanos',
            args={'alpha':0.01, 'fc':2, 'num_annotations': 50,
                  'colorscale':'Blues', 'showscale': False,
                  'marker_size':8, 'x_title':'log2FC', 'y_title':'-log10(pvalue)'})

for plot in volcano_plot:
    iplot(plot.figure)

[70]:

go_terms_query = "MATCH (p:Protein)-[]-(bp:Biological_process) WHERE (p.name+'~'+p.id) IN {} RETURN DISTINCT (p.name+'~'+p.id) AS identifier,bp.name AS annotation"
go_terms_query = go_terms_query.format(df_wide.columns.tolist())
annotation = connector.run_query(go_terms_query)

[71]:

annotation.head()

[71]:

	annotation	identifier
0	mitochondrial genome maintenance	AKT3~Q9Y243
1	mitochondrial genome maintenance	TYMP~P19971
2	regulation of DNA recombination	IL7R~P16871
3	cell wall mannoprotein biosynthetic process	MPI~P34949
4	very long-chain fatty acid metabolic process	ACAA1~P09110

[72]:

annotation.shape

[72]:

(19230, 2)

[73]:

enrichment_results = analytics.run_up_down_regulation_enrichment(ancova_results, annotation, identifier='identifier', groups=['group1', 'group2'], annotation_col='annotation', reject_col='rejected', group_col='group', method='fisher', correction='fdr_bh', alpha=0.01, lfc_cutoff=1)

c:\users\sande\.conda\envs\pip_rev\lib\site-packages\pandas\core\frame.py:6692: FutureWarning:

Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.